Semantic Structure and Interpretability of Word Embeddings

نویسندگان

  • Lutfi Kerem Senel
  • Ihsan Utlu
  • Veysel Yücesoy
  • Aykut Koç
  • Tolga Çukur
چکیده

Dense word embeddings, which encode semantic meanings of words to low dimensional vector spaces, have become very popular in natural language processing (NLP) research due to their state-of-the-art performances in many NLP tasks. Word embeddings are substantially successful in capturing semantic relations among words, so a meaningful semantic structure must be present in the respective vector spaces. However, in many cases, this semantic structure is broadly and heterogeneously distributed across the embedding dimensions making interpretation of dimensions a big challenge. In this study, we propose a statistical method to uncover the underlying latent semantic structure in the dense word embeddings. To perform our analysis, we introduce a new dataset (SEMCAT) that contains more than 6,500 words semantically grouped under 110 categories. We further propose a method to quantify the interpretability of the word embeddings. The proposed method is a practical alternative to the classical word intrusion test that requires human intervention.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Learning of Interpretable Word Embeddings

Word embeddings encode semantic meanings of words into low-dimension word vectors. In most word embeddings, one cannot interpret the meanings of specific dimensions of those word vectors. Nonnegative matrix factorization (NMF) has been proposed to learn interpretable word embeddings via non-negative constraints. However, NMF methods suffer from scale and memory issue because they have to mainta...

متن کامل

On the Use of Word Embeddings Alone to Represent Natural Language Sequences

To construct representations for natural language sequences, information from two main sources needs to be captured: (i) semantic meaning of individual words, and (ii) their compositionality. These two types of information are usually represented in the form of word embeddings and compositional functions, respectively. For the latter, Recurrent Neural Networks (RNNs) and Convolutional Neural Ne...

متن کامل

Best of Both Worlds: Making Word Sense Embeddings Interpretable

Word sense embeddings represent a word sense as a low-dimensional numeric vector. While this representation is potentially useful for NLP applications, its interpretability is inherently limited. We propose a simple technique that improves interpretability of sense vectors by mapping them to synsets of a lexical resource. Our experiments with AdaGram sense embeddings and BabelNet synsets show t...

متن کامل

Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

We consider probabilistic topic models and more recent word embedding techniques from a perspective of learning hidden semantic representations. Inspired by a striking similarity of the two approaches, we merge them and learn probabilistic embeddings with online EM-algorithm on word co-occurrence data. The resulting embeddings perform on par with Skip-Gram Negative Sampling (SGNS) on word simil...

متن کامل

Improve Chinese Word Embeddings by Exploiting Internal Structure

Recently, researchers have demonstrated that both Chinese word and its component characters provide rich semantic information when learning Chinese word embeddings. However, they ignored the semantic similarity across component characters in a word. In this paper, we learn the semantic contribution of characters to a word by exploiting the similarity between a word and its component characters ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1711.00331  شماره 

صفحات  -

تاریخ انتشار 2017